54 research outputs found
Iterative Few-shot Semantic Segmentation from Image Label Text
Few-shot semantic segmentation aims to learn to segment unseen class objects
with the guidance of only a few support images. Most previous methods rely on
the pixel-level label of support images. In this paper, we focus on a more
challenging setting, in which only the image-level labels are available. We
propose a general framework to firstly generate coarse masks with the help of
the powerful vision-language model CLIP, and then iteratively and mutually
refine the mask predictions of support and query images. Extensive experiments
on PASCAL-5i and COCO-20i datasets demonstrate that our method not only
outperforms the state-of-the-art weakly supervised approaches by a significant
margin, but also achieves comparable or better results to recent supervised
methods. Moreover, our method owns an excellent generalization ability for the
images in the wild and uncommon classes. Code will be available at
https://github.com/Whileherham/IMR-HSNet.Comment: ijcai 202
Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation
Unsupervised Domain Adaptation (UDA) aims to adapt the model trained on the
labeled source domain to an unlabeled target domain. In this paper, we present
Prototypical Contrast Adaptation (ProCA), a simple and efficient contrastive
learning method for unsupervised domain adaptive semantic segmentation.
Previous domain adaptation methods merely consider the alignment of the
intra-class representational distributions across various domains, while the
inter-class structural relationship is insufficiently explored, resulting in
the aligned representations on the target domain might not be as easily
discriminated as done on the source domain anymore. Instead, ProCA incorporates
inter-class information into class-wise prototypes, and adopts the
class-centered distribution alignment for adaptation. By considering the same
class prototypes as positives and other class prototypes as negatives to
achieve class-centered distribution alignment, ProCA achieves state-of-the-art
performance on classical domain adaptation tasks, {\em i.e., GTA5
Cityscapes \text{and} SYNTHIA Cityscapes}. Code is available at
\href{https://github.com/jiangzhengkai/ProCA}{ProCA
Align, Perturb and Decouple: Toward Better Leverage of Difference Information for RSI Change Detection
Change detection is a widely adopted technique in remote sense imagery (RSI)
analysis in the discovery of long-term geomorphic evolution. To highlight the
areas of semantic changes, previous effort mostly pays attention to learning
representative feature descriptors of a single image, while the difference
information is either modeled with simple difference operations or implicitly
embedded via feature interactions. Nevertheless, such difference modeling can
be noisy since it suffers from non-semantic changes and lacks explicit guidance
from image content or context. In this paper, we revisit the importance of
feature difference for change detection in RSI, and propose a series of
operations to fully exploit the difference information: Alignment, Perturbation
and Decoupling (APD). Firstly, alignment leverages contextual similarity to
compensate for the non-semantic difference in feature space. Next, a difference
module trained with semantic-wise perturbation is adopted to learn more
generalized change estimators, which reversely bootstraps feature extraction
and prediction. Finally, a decoupled dual-decoder structure is designed to
predict semantic changes in both content-aware and content-agnostic manners.
Extensive experiments are conducted on benchmarks of LEVIR-CD, WHU-CD and
DSIFN-CD, demonstrating our proposed operations bring significant improvement
and achieve competitive results under similar comparative conditions. Code is
available at https://github.com/wangsp1999/CD-Research/tree/main/openAPDComment: To appear in IJCAI 202
Self-supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes
Robust autonomous driving requires agents to accurately identify unexpected
areas in urban scenes. To this end, some critical issues remain open: how to
design advisable metric to measure anomalies, and how to properly generate
training samples of anomaly data? Previous effort usually resorts to
uncertainty estimation and sample synthesis from classification tasks, which
ignore the context information and sometimes requires auxiliary datasets with
fine-grained annotations. On the contrary, in this paper, we exploit the strong
context-dependent nature of segmentation task and design an energy-guided
self-supervised frameworks for anomaly segmentation, which optimizes an anomaly
head by maximizing the likelihood of self-generated anomaly pixels. To this
end, we design two estimators for anomaly likelihood estimation, one is a
simple task-agnostic binary estimator and the other depicts anomaly likelihood
as residual of task-oriented energy model. Based on proposed estimators, we
further incorporate our framework with likelihood-guided mask refinement
process to extract informative anomaly pixels for model training. We conduct
extensive experiments on challenging Fishyscapes and Road Anomaly benchmarks,
demonstrating that without any auxiliary data or synthetic models, our method
can still achieves competitive performance to other SOTA schemes
Learning Global-aware Kernel for Image Harmonization
Image harmonization aims to solve the visual inconsistency problem in
composited images by adaptively adjusting the foreground pixels with the
background as references. Existing methods employ local color transformation or
region matching between foreground and background, which neglects powerful
proximity prior and independently distinguishes fore-/back-ground as a whole
part for harmonization. As a result, they still show a limited performance
across varied foreground objects and scenes. To address this issue, we propose
a novel Global-aware Kernel Network (GKNet) to harmonize local regions with
comprehensive consideration of long-distance background references.
Specifically, GKNet includes two parts, \ie, harmony kernel prediction and
harmony kernel modulation branches. The former includes a Long-distance
Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction
Blocks (KPB) to predict multi-level harmony kernels by fusing global
information with local features. To achieve this goal, a novel Selective
Correlation Fusion (SCF) module is proposed to better select relevant
long-distance background references for local harmonization. The latter employs
the predicted kernels to harmonize foreground regions with both local and
global awareness. Abundant experiments demonstrate the superiority of our
method for image harmonization over state-of-the-art methods, \eg, achieving
39.53dB PSNR that surpasses the best counterpart by +0.78dB ;
decreasing fMSE/MSE by 11.5\%/6.7\% compared with the
SoTA method. Code will be available at
\href{https://github.com/XintianShen/GKNet}{here}.Comment: 10 pages, 10 figure
Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region
Stroke-based rendering aims to recreate an image with a set of strokes. Most
existing methods render complex images using an uniform-block-dividing
strategy, which leads to boundary inconsistency artifacts. To solve the
problem, we propose Compositional Neural Painter, a novel stroke-based
rendering framework which dynamically predicts the next painting region based
on the current canvas, instead of dividing the image plane uniformly into
painting regions. We start from an empty canvas and divide the painting process
into several steps. At each step, a compositor network trained with a phasic RL
strategy first predicts the next painting region, then a painter network
trained with a WGAN discriminator predicts stroke parameters, and a stroke
renderer paints the strokes onto the painting region of the current canvas.
Moreover, we extend our method to stroke-based style transfer with a novel
differentiable distance transform loss, which helps preserve the structure of
the input image during stroke-based stylization. Extensive experiments show our
model outperforms the existing models in both stroke-based neural painting and
stroke-based stylization. Code is available at
https://github.com/sjtuplayer/Compositional_Neural_PainterComment: ACM MM 202
TEINet: Towards an Efficient Architecture for Video Recognition
Efficiency is an important issue in designing video architectures for action
recognition. 3D CNNs have witnessed remarkable progress in action recognition
from videos. However, compared with their 2D counterparts, 3D convolutions
often introduce a large amount of parameters and cause high computational cost.
To relieve this problem, we propose an efficient temporal module, termed as
Temporal Enhancement-and-Interaction (TEI Module), which could be plugged into
the existing 2D CNNs (denoted by TEINet). The TEI module presents a different
paradigm to learn temporal features by decoupling the modeling of channel
correlation and temporal interaction. First, it contains a Motion Enhanced
Module (MEM) which is to enhance the motion-related features while suppress
irrelevant information (e.g., background). Then, it introduces a Temporal
Interaction Module (TIM) which supplements the temporal contextual information
in a channel-wise manner. This two-stage modeling scheme is not only able to
capture temporal structure flexibly and effectively, but also efficient for
model inference. We conduct extensive experiments to verify the effectiveness
of TEINet on several benchmarks (e.g., Something-Something V1&V2, Kinetics,
UCF101 and HMDB51). Our proposed TEINet can achieve a good recognition accuracy
on these datasets but still preserve a high efficiency.Comment: Accepted by AAAI 202
- …